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Abstract 



rf\ . A fc-modal probability distribution over the domain {!,...,«} is one whose histogram has at most k 

/^ ' "peaks" and "valleys." Such distributions are natural generalizations of monotone (fc = 0) and unimodal 



(fc = 1) probability distributions, which have been intensively studied in probability theory and statistics. 

In this paper we consider the problem of learning an unknown fc-modal distribution. The learning al- 
gorithm is given access to independent samples drawn from the /c-modal distribution p, and must output a 
hypothesis distribution p such that with high probability the total variation distance between p and p is at 
most e. 



f~^ ' We give an efficient algorithm for this problem that runs in time poly(/c, log(n), 1/e). Forfc < O(vlogn), 

C^ , the number of samples used by our algorithm is very close (within an 0(log(l/e)) factor) to being information- 

C""^ ' theoretically optimal. Prior to this work computationally efficient algorithms were known only for the cases 

^. : fc = 0, 1. 

t^^ ' A novel feature of our approach is that our learning algorithm crucially uses a new property testing 

^^ , algorithm as a key subroutine. The learning algorithm uses the property tester to efficiently decompose the 
fc-modal distribution into fc (near)-monotone distributions, which are easier to learn. 
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1 Introduction 

This paper considers a natural unsupervised learning problem involving k-modal distributions over the discrete 
domain {1, . . . , n}. A distribution is /c-modal if the plot of its probability density function (pdf) has at most 
k "peaks" and "valleys" (see Section |2?T] for a precise definition). Such distributions arise both in theoretical 
(see e.g. IICKC83[ IKem911 ICT04I ') and appHed (see e.g. IIMur64[ ldTF90[ lFPP+981 ) research; they naturally 
generalize the simpler classes of monotone {k = 0) and unimodal {k = 1) distributions that have been intensively 
studied in probability theory and statistics (see the discussion of related work below). 

Our main aim in this paper is to give an efficient algorithm for learning an unknown A;-modal distribution p 
to total variation distance e, given access only to independent samples drawn from p. As described below there 
is an information-theoretic lower bound of Q.{k log(n//c)/e'^) samples for this learning problem, so an important 
goal for us is to obtain an algorithm whose sample complexity is as close as possible to this lower bound (and 
of course we want our algorithm to be computationally efficient, i.e. to run in time polynomial in the size of 
its input sample). Our main contribution in this paper is a computationally efficient algorithm that has nearly 
optimal sample complexity for small (but super-constant) values of k. 

1.1 Background and related work 

There is a rich body of work in the statistics and probability literatures on estimating distributions under various 
kinds of "shape" or "order" restrictions. In particular, many researchers have studied the risk of different esti- 
mators for monotone and unimodal distributions; see for example the works of l|Rao691 Weg70 IGro85[|Bir87al 



IBir87bl lBir97L among many others. In the language of computational learning theory, these and related papers 
from the probability/statistics literature mostly deal with information-theoretic upper and lower bounds on the 
sample complexity of learning monotone and unimodal distributions. It should be noted that some of these 
works do give computationally efficient algorithms for the cases A; = and A; = 1; in particular we mention the 
result of Birge IIBir87bl . which gives a computationally efficient 0(log(n)/e'^)-sample algorithm for learning 
any unknown monotone distribution over [n] . (Birge IIBir87al also showed that this sample complexity is asymp- 
totically optimal, as we discuss below; we describe Birge's algorithm in more detail in Section [Z2l and indeed 
use it as an ingredient of our approach throughout this paper.) However, for these relatively simple A; = 0, 1 
classes of distributions the main challenge is in developing sample-efficient estimators, and the algorithmic as- 
pects are typically rather straightforward (as is the case in [BirSTbJ). In contrast, much more challenging and 
interesting algorithmic issues arise for the general values of k which we consider in this paper. 

1.2 Our Results 

Our main result is a highly efficient algorithm for learning an unknown fc-modal distribution over [n] : 

Theorem 1 Let p be any unknown k-modal distribution over [n]. There is an algorithm that uses^ 

'klogin/k) k^ ^ ^ , 1 k\ ~,^ . ,^,, 
3 +^-log--loglog- -O log 1/(5 

samples from p, runs for poly(A;,logn, l/e,log(l/5)) bit-operations, and with probability \ — b outputs a 
(succinct description of a) hypothesis distribution p over [n] such that the total variation distance between p and 
p is at most e. 

As alluded to earlier, Birge IIBir87 a1 gave a sample complexity lower bound for learning monotone distri- 
butions. The lower bound in ltBir87all is stated for continuous distributions but the arguments are easily adapted 
to the discrete case; IIBir87all shows that (for e > l/n^^^^ j3 any algorithm for learning an unknown monotone 



'We write O(-) to hide factors wiiich are polylogarithmic in tiie argument to O(-); thus for example 0(a log 6) denotes a quantity 
which is 0((a log 6) • (log(alog&))'^) for some absolute constant c. 

^For e sufficiently small the generic upper bound of TheoremjS] which says that any distribution over [n] can be learned to variation 
distance e using 0{n/e^) samples, provides a better bound. 
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distribution over [n] to total variation distance e must use Q.{\og{n) / e^) samples. By a simple construction 
which concatenates k copies of the monotone lower bound construction over intervals of length n/k, using the 
monotone lower bound it is possible to show: 

Proposition 1 Any algorithm for learning an unknown k-modal distribution over [n] to variation distance e (for 
e > 1/n^'^' j must use Q{k log{n/k)/e^) samples. 

Thus our learning algorithm is nearly optimally efficient in its sample complexity; more precisely, for k < 
0{\/logn) (and e as bounded above), our sample complexity in Theorem [T] is asymptotically optimal up to a 
factor of 0(log(l/e)). Since each draw from a distribution over [n] is a log(n)-bit string, Proposition [T] implies 
that the running time of our algorithm is optimal up to polynomial factors. We note that to the best of our 
knowledge, prior to this work no learning algorithm for A:-modal distributions was known that even had running 
time fixed polynomial in n. 

1.3 Our Approach 

As mentioned in Section 11.11 Birge gave a highly efficient algorithm for learning a monotone distribution in 
IIBir87bl . Since a fc-modal distribution is simply a concatenation of fc + 1 monotone distributions (first non- 
increasing, then non-decreasing, then non-increasing, etc.), it is natural to try to use Birge's algorithm as a 
component of an algorithm for learning A;-modal distributions, and indeed this is what we do. 

The most naive way to use Birge's algorithm would be to guess all possible (^) locations of the k "modes" 
of p. While such an approach can be shown to have good sample complexity, the resulting ^{n^) running time 
is grossly inefficient. A "moderately naive" approach, which we analyze in Section [3]T] is to partition [n] into 
roughly k/e intervals each of weight roughly e/k, and run Birge's algorithm separately on each such interval. 
Since the target distribution is fc-modal, at most k of the intervals can be non-monotone; Birge's algorithm can 
be used to obtain an e-accurate hypothesis on each monotone interval, and even if it fails badly on the (at most) 
k non-monotone intervals, the resulting total contribution towards the overall error from those failures is at most 
0(e). This approach is much more efficient than the totally naive approach, giving running time polynomial in 
k, log n, and 1/e, but its sample complexity turns out to be polynomially worse than the 0{k log(n)/e^) that we 
are shooting for. (Roughly speaking, this is because the approach involves running Birge's 0(log(n)/e'^)-sample 
algorithm k/e times, so it uses at least k log(n)/e^ samples.) 

Our main learning result is achieved by augmenting the "moderately naive" algorithm sketched above with a 
new property testing algorithm. We give a property testing algorithm for the following problem: given samples 
from a A;-modal distribution p, output "yes" if p is monotone and "no" if p is e-far from every monotone dis- 
tribution. Crucially, our testing algorithm uses 0(A;^/e^) samples independent ofn for this problem. Roughly 
speaking, by using this algorithm 0{k/e) times we are able to identify k + l intervals that (i) collectively contain 
almost all of p's mass, and (ii) are each (close to) monotone and thus can be handled using Birge's algorithm. 
Thus the overall sample complexity of our approach is (roughly) (k/e)^ (for the k/e runs of the tester) plus 
k log(n)/€^ (for the k runs of Birge's algorithm), which gives TheoremUJand is very close to optimal for k not 
too large. 

1.4 Discussion 

Our learning algorithm highlights a novel way that property testing algorithms can be useful for learning. Much 
research has been done on understanding the relation between property testing algorithms and learning algo- 
rithms, see e.g. IIGGR98[ iKROOl and the lengthy survey MRonOSI . As Goldreich has noted MGoll . an often- 
invoked motivation for property testing is that (inexpensive) testing algorithms can be used as a "preliminary 
diagnostic" to determine whether it is appropriate to run a (more expensive) learning algorithm. In contrast, in 
this work we are using property testing rather differently, as an inexpensive way of decomposing a "complex" 
object (a fc-modal distribution) which we do not a priori know how to learn, into a collection of "simpler" ob- 
jects (monotone or near-monotone distributions) which can be learned using existing techniques. We are not 



aware of prior learning algorithms that successfully use property testers in this way; we believe that this high- 
level approach to designing learning algorithms, by using property testers to decompose "complex" objects into 
simpler objects that can be efficiently learned, may find future appUcations elsewhere. 

2 Preliminaries 

2.1 Notation and Problem Statement 

For n S Z+, denote by [n] the set {1, . . . , n}; for z, j € Z+, i < j, denote by [i, j] the set {«, i + 1, . . . , j}. For 
V = (w(l), . . . ,v{n)) G R" denote by ||t;||i = Yl^=i b(^)l its -^i-norm. 

We consider discrete probability distributions over [n], which are functions p : [n] ^ [0,1] such that 

Y17=iPi''') ~ ^- ^^^ ^ — [""] ^^ write p{S) to denote YliesPi''')- ^^^ ^ — I""]' ^^ write ps to denote the 
conditional distribution over S that is induced by p. We use the notation P for the cumulative distribution 
function (cdf) corresponding to p, i.e. P : [n] — ;• [0, 1] is defined by P{j) = Yli=i P(^)- 

A distribution p over [n] is non-increasing (resp. non-decreasing) iip{i + 1) < p{i) (resp. p{i + 1) > p{i)), 
for alH G [n — 1]; p is monotone if it is either non-increasing or non-decreasing. We call a nonempty interval 
/ = [a,b] C [2,n — 1] a max-interval of p ii p{i) = c for alH G / and max{p(o — l),p{b + 1)} < c; 
in this case, we say that the point a is a left max point of p. Analogously, a min-interval of p is an interval 
I = [a, b] C [2, n — 1] with p{i) = c for all i G / and min{p(a — l),p{b + 1)} > c; the point a is called a 
left min point of p. If / = [a, b] is either a max-interval or a min-interval (it cannot be both) we say that / is 
an extreme-interval of p, and a is called a left extreme point of p. Note that any distribution uniquely defines 
a collection of extreme-intervals (hence, left extreme points). We say that p is k-modal if it has at most k 
extreme-intervals. 

Let p, q be distributions over [n] with corresponding cdfs P, Q. The total variation distance between p and 
q is d'Yy{p,q) := max5c[ra] \p{S) — q{S)\ = (1/2) • \\p — q\\i. The Kolmogorov distance between p and q is 
defined as dxip, q) := maxjg[„] \P{j) - Q{j)\ . Note that dxip, q) < d^vip, q)- 

Learning /c-modal Distributions. Given independent samples from an unknown A;-modal distribution p G A4^ 
and e > 0, the goal is to output a hypothesis distribution h such that with probability l—Swe have d^v {p, h) < e. 
We say that such an algorithm A learns p to accuracy e and confidence 6. The parameters of interest are the 
number of samples and the running time required by the algorithm. 

2.2 Basic Tools 

We will need three tools from probability theory. 

Our first tool says that 0(l/e^) samples suffice to learn any distribution within error e with respect to the 
Kolmogorov distance. This fundamental fact is known as the Dvoretzky-Kiefer-Wolfowitz (DKW) inequality 
(||DKW56|). Given m independent samples si, . . . , Sm, drawn from p : [n] — )• [0, 1], the empirical distribution 
Pm '■ [n] — ^ [0, 1] is defined as follows: for all z G [n], Pm{i) = \{j £ [m] \ Sj = i}\/m. The DKW inequality 
states that for m = Q{{l/e^) ■ ln{l/6)), with probability 1 — 5 the empirical distribution pm will be e-close to p 
in Kolmogorov distance. This sample bound is asymptotically optimal and independent of the support size. 



Theorem 2 ( IIDKW56llMas90ll ) For all e > 0, it holds: FT[dK{p,Pm) > e] < 2e 



-2me^ 



Our second tool, due to Birge IIBir87bl . provides a sample-optimal and computationally efficient algorithm 
to learn monotone distributions in total variation distance. Before we state the relevant theorem, we need a 
definition. We say that a distribution p is (5-close to being non-increasing (resp. non-decreasing) if there exists a 
non-increasing (resp. non-decreasing) distribution q such that dxy (p, q) < S. We are now ready to state Birge's 
result: 

Theorem 3 (|Bir87b|, Theorem 1) (semi-agnostic learner) There is an algorithm L~^ with the following per- 
formance guarantee: Given m independent samples from a distribution p over \n\ which is O'pt-close to being 



non-increasing, L^ performs 0{m ■ log n + m}''^ ■ (log n)^''^) bit -operations and outputs a (succinct description 
of a) hypothesis distribution p over [n] that satisfies 

^[dTv{p,p)] < 2 • opt + o((logn/(m + 1))^/^' 



The aforementioned algorithm partitions the domain [n\ in 0{m}'^ ■ (log n)"^'^) intervals and outputs a hypoth- 
esis distribution that is uniform within each of these intervals. 

By taking m = Q.{\ogn/e^), one obtains a hypothesis such that 'Ei[d^v^-,P)\ < 2 • opt + e. We stress 
that Birge's algorithm for learning non-increasing distributions IIBir87bll is in fact "semi-agnostic", in the sense 
that it also learns distributions that are close to being non-increasing; this robustness will be crucial for us later 
(since in our final algorithm we will use Birge's algorithm on distributions identified by our tester, that are 
close to monotone but not necessarily perfectly monotone). This semi-agnostic property is not explicitly stated 
in ||Bir87bll but it can be shown to follow easily from his results. We show how the semi-agnostic property 
follows from Birge's results in Appendix [A] Let L^ denote the corresponding semi-agnostic algorithm for 
learning non-decreasing distributions. 

Our final tool is a routine to do hypothesis testing, i.e. to select a high-accuracy hypothesis distribution 
from a collection of hypothesis distributions one of which has high accuracy. The need for such a routine 
arises in several places; in some cases we know that a distribution is monotone, but do not know whether it is 
non-increasing or non-decreasing. In this case, we can run both algorithms L^ and Ij^ and then choose a good 
hypothesis using hypothesis testing. Another need for hypothesis testing is to "boost confidence" that a learning 
algorithm generates a high-accuracy hypothesis. Our initial version of the algorithm for Theorem [Ugenerates an 
e-accurate hypothesis with probability at least 9/10; by running it 0(log(l/5)) times using a hypothesis testing 
routine, it is possible to identify an 0(e) -accurate hypothesis with probability 1 — 5. Routines of the sort that we 
require have been given in e.g. tPLOlll and IIDDSII ; we use the following theorem from [DDSJ: 

Theorem 4 There is an algorithm Choose-Hypothesis*'(/ii, /12, e', 5') which is given oracle access to p, 
two hypothesis distributions hi, /i2 for p, an accuracy parameter e', and a confidence parameter 5' . It makes 
m = 0{log{l/S')/ e'^) draws from p and returns a hypothesis h ^ {hi,h2}.Ifoneofhi,h2hasd'Tv(hijP) — ^ 
then with probability \ — b' the hypothesis h that Choose-Hypothesis returns has dTv{h,p) < 6e'. 

For the sake of completeness, we describe and analyze the Choose-Hypothesis algorithm in Appendix IB] 

3 Learning fc-modal Distributions 

In this section, we present our main result: a nearly sample-optimal and computationally efficient algorithm to 
learn an unknown /c-modal distribution. In Section lTTl we present a simple learning algorithm with a suboptimal 
sample complexity. In Section [l!2l we present our main result which involves a property testing algorithm as a 
subroutine. 

3.1 Warm-up: A simple learning algorithm. 

In this subsection, we give an algorithm that runs in time poly{k, log n, 1/e, log{l/6)) and learns an unknown 
A;-modal distribution to accuracy e and confidence 6. The sample complexity of the algorithm is suboptimal as a 
function of e, by a polynomial factor. 

In the following figure we give the algorithm Learn — kmodal — simple which produces an e-accurate 
hypothesis with confidence 9/10 (see Theorem |5]). We explain how to boost the confidence to 1 — 6 after the 
proof of the theorem. 



Learn-kmodal-simple 

Inputs: e > 0; sample access to fc-modal distribution p over [n] 

1. Fix r := e'^/{100k). Draw r = 0(l/r^) samples from p and let p denote the resulting empirical 
distribution. 

2. Greedily partition the domain [n] into i atomic intervals I := {/j}^^^ as follows: /i := [1, j'l], 
where ji := m.m{j G [n] \ p{[l,j]) > e/{lOk)}. Fori > 1, if Wj^^Ij = [l,jj], then Ij+i := 
[ji + 1, ji+i], where jj+i is defined as follows: lfp{[ji + 1, n]) > e/{10k), then jj+i := min{j e 
N I P(bi + 1. j]) > e/(10^)}' otherwise, jj+i := n. 

3. Construct a set of £ Z/g/zf intervals I' := {1^'}^^^ and a set {bi}l^^ oft<£ heavy points as follows: 
For each interval Jj = [a, b] G X, if p(/) > e/{5k) define I'- := [a, 6—1] and make b a heavy 
point. (Note that it is possible to have I'^ = 0.) Otherwise, define ![ := /j. 

Fix 6' :=e/(500/c). 

4. Draw m = {k/e'^) ■ log(n) • ©(log(l/5')) samples s = {sj}™ ^ from p. For each light interval I[, 
i G [P\, run both L^^/ and fJ s' on the conditional distribution pp using the samples in s n I[. Let 

pt,, Pji be the corresponding conditional hypothesis distributions. 

i i 

5. Draw m' = Q{{k/e^) ■ log(l/5')) samples s' = {s'j}'^^ from p. For each light interval I[, 
i G [i], run Choose-Hypothesis^(pT,,p;,, e, 5') using the samples in s' n I^. Denote by pj/ 

i i * 

the returned conditional distribution on I[. 

6. Output the hypothesis h = Y!'j=iP{I'j) ' Pi' + Yfj=iP{bj) ■ If,,- 



The algorithm Learn — kmodal — simple works as follows: We start by partitioning the domain [n] into 
consecutive intervals of mass approximately e/k. To do this, we make use of the DKW inequality, with accuracy 
parameter roughly e^/k. (Some care is needed in this step, since there may be "heavy" points in the support 
of the distribution; however, we gloss over this technical issue for the sake of this intuitive explanation.) If this 
step is successful, we have partitioned the domain into a set of 0{k/e) consecutive intervals of probability mass 
roughly e/k. Our next step is to apply Birge's monotone learning algorithm to each interval. 

A caveat comes from the fact that not all such intervals aie guaranteed to be monotone (or even close to 
being monotone). However, since our input distribution is assumed to be fc-modal, all but (at most) k of these 
intervals are monotone. Call a non-monotone interval "bad". Since all intervals have probability mass at most 
e/k and there are at most k bad intervals, these intervals contribute at most e to the total mass. So even though 
Birge's algorithm gives no guarantees for bad intervals, these intervals do not affect the error by more than e. 

Let us now focus on the monotone intervals. For each such interval, we do not know if it is monotone 
increasing or monotone decreasing. To overcome this difficulty, we run both monotone algorithms L-^ and L^^ 
for each interval and then use hypothesis testing to choose the correct candidate distribution. 

Also, note that since we have k/e intervals, we need to run each instance of both the monotone learning 
algorithms and the hypothesis testing algorithm with confidence 1 — 0{e/k), so that we can guarantee that the 
overall algorithm has confidence 9/10. Note that Theorem [3] and Markov's inequality imply that if we draw 
Q,{\ogn/e^) samples from a non-increasing distribution p, the hypothesis p output by L-l- satisfies d-j^v{p,p) < e 
with probability 9/10. We can boost the confidence io 1 — 6 with an overhead of 0(log(l/5) log \og{\/5)) in 
the sample complexity: 

Fact 2 Let p be a non-increasing distribution over [n]. There is an algorithm L^^ with the following perfor- 
mance guarantee: Given (logn/e'^)-0(log(l/(5))) samples from p, Ij^ ^ performs O ((log^n/e'^) • \og{\/5)) bit- 
operations and outputs a (succinct description of a) hypothesis distribution pover [n] that satisfies d^y (p, p) < e 



with probability at least 1 — 5. 

The algorithm L^^ runs L^ O (log (1/(5)) times and performs a tournament among the candidate hypothe- 
ses using Choose-Hypothesis. Let L^^ denote the corresponding algorithm for learning non-decreasing 
distributions with confidence 6. We postpone further details on these algorithms to Appendix O 

Theorem 5 The algorithm Learn-kmodal-simple uses — ^s^ . Q (jQg -\ -^- Q lk^\ samples, performs 
poly(A;, log n, 1/e) bit-operations, and learns a k-modal distribution to accuracy 0(e) with probability 9/10. 

Proof: We first prove that with probability 9/10 (over its random samples), algorithm Learn-kmodal-simple 
outputs a hypothesis h such that dTv{h,p) < 0{e). 

Since r = ©(l/r^) samples are drawn in Step 1, the DKW inequality implies that with probability of 
failure at most 1/100, for each interval / C [n] we have \p{I) — p{I)\ < 2t. For the rest of the analysis of 
Learn-kmodal-simple we condition on this "good" event. 

Since every atomic interval / G Xhas p{I) > e/{lOk) (except potentially the rightmost one), it follows that 
the number £ of atomic intervals constructed in Step 2 satisfies ^ < 10 • (k/e). By the construction in Steps 2 
and 3, every light interval /' G Z' has p{I') < e/{5k), which implies p{I') < e/{5k) + 2r. Note also that every 
heavy point b has p{b) > e/{10k) and the number of heavy points t is at most £. 

Since the light intervals and heavy points form a partition of [n], we can write p = Ylij=iP{^'j) ' Pr + 
Y2^j=i Pi^j) • If) • Therefore, we can bound the variation distance as follows: 

dTv{h,p) < E \p{I'j)-pm)\ + E \p{bj) -p{b,)\ + E p{l'^) . dT:v{Pi',Pi'). 
j=i j=i j=i ' ' 

By the DKW inequality, each term in the first two sums is bounded from above by 2t. Hence the contribution 
of these terms to the sum is at most 2t ■ {£ + 1) < At ■ £ < 2e/5. 

We proceed to bound the contribution of the third term. Since p is /c-modal, at most k of the light intervals 
/' are not monotone for p. Call these intervals "bad". Even though we have not identified the bad intervals, we 
know that all such intervals are light. Therefore, their total probability mass under p is at most k • {e/{5k) + 2r). 
This implies that the contribution of bad intervals to the third term of the variation distance is at most e/4. 
(Note that this statement holds true independent of the samples s we draw in Step 4.) It remains to bound the 
contribution of monotone intervals to the third term. 

Let £' < £be the number of monotone light intervals and assume after renaming the indices that they are 
I := {IlYi^i- To bound the variation distance, it suffices to show that with probability at least 19/20 (over the 
samples drawn in Steps 4-5) it holds 

J:p{I'^)-dTv{pr.,Pr.) = 0{e) (1) 

i=i ' ' 

Note first that we do not have a lower bound on the probability mass of the intervals in I. We partition this set in 
two subsets: the subset Z' containing those intervals whose probability mass under p is at most e^/(20/c); and its 
complement Z". It is clear that the contribution of X' to the above expression can be at most £ ■ e/{20k) < e/2. 
We further partition the set Z" of remaining intervals into b = [log(5/e)] groups. For i e [b], the set {Z")i 
consists of those intervals in Z" that have mass under p in the range [2~* • {e/bk), 2^*+^ • (e/5A;)] . (Note that 
these intervals collectively cover all intervals in Z", since each such interval has weight between e^/(20A;) and 
e/(4/c) -recall that every light interval /' G Z' satisfies p{I') < e/{bk) + 2t < e/{Ak).) We have: 

Claim 3 With probability at least 19/20 (over the sample s, s'), for each i G [r] and each monotone light 
interval /' G iZ")i we have dTv(pT'.,Pr-) = 0(2*/^ • e). 



Proof: Since in Step 4 we draw 771 samples, and each interval /'• € (X")j hasp(/') G [2~* • (e/5A;), 2^*+^ • (e/5A;)], 
a standard coupon collector argument [ NS60.I tells us that with probability 99/100, for each {i, j) pair, the inter- 
val /'■ will get at least 2~* • (log(n)/e^) • tl{log{l/6')) many samples. Let's rewrite this as (log(n)/(2'/^ • e)^) • 
r2(log(l/5')) samples. We condition on this event. 

Fix an interval I' € {I")i- We first show that with failure probability at most e/(500/c) after Step 4, either 

Pj, or jK, will be (2*/^ • e)-accurate. Indeed, by Fact|2]and taking into account the number of samples that landed 

j j 

in /', with probability 1 — e/(500/c) over s, d'^vilhUPi') < 2*/^e, where a^ =| if p/' is non-increasing and 
at =t otherwise. By a union bound over all (at most £ many) {i,j) pairs, it follows that with probability at least 
49/50, for each interval /' G {^")i one of the two candidate hypothesis distributions is (2*' ^e) -accurate. We 
condition on this event. 

Consider Step 5. For a fixed interval /'■ G (X")j, Theorem|4]implies that the algorithm Choose-Hypothesis 
will output a hypothesis that is 6 • (2*/^e)-close to pji with probality 1 — e/(500A;). By a union bound, it follows 
that with probability at least 49/50, the above condition holds for all monotone light intervals under considera- 
tion. Therefore, except with failure probability 19/20, the statement of the Claim holds. ■ 

Assuming the claim, ([B follows by exploiting the fact that for intervals /' such that p(/') is small we can 

afford worse error on the variation distance. More precisely, let Wi = \{I")i\, the number of intervals in {T")i, 
and note that ^j^^ Wi < £. Hence, we can bound the LHS of (O from above by 

E Wi ■ {e/5k) ■ 2-'+i • 0(2'/3 . e) < 0(1) . (2eV5A:) ■ Y. ^i ■ 2-2^3 

Since ^^-i Wi < i, the above expression is maximized for w\ = i and Wj = 0, i > 1, and the maximum value 
is at most 0(1) • {2e^/bk) ■ i = 0(e). This proves ©. 

It is clear that the algorithm has the claimed sample complexity. The running time is also easy to analyze, 
as it is easy to see that every step can be performed in polynomial time (in fact, nearly linear time) in the sample 
size. This completes the proof of Theorem [5] ■ 

To get an 0(e)-accurate hypothesis with probability 1 — 5, we can simply run Learn-kmodal-simple 
O (log (1/5)) times and then perform a tournament using Theorem IH This increases the sample complexity by 
a 0{log{l/6)) factor. The running time increases by a factor of 0(log^(l/5)). We postpone the details for 
Appendix O 

3.2 Main Result: Learning /c-modal distributions using testing 

Here is some intuition to motivate our fc-modal distribution learning algorithm and give a high-level idea of why 
the dominant term in its sample complexity is 0{klog{n/k)/e^). 

Let p denote the target fc-modal distribution to be learned. As discussed above, optimal (in terms of time 
and sample complexity) algorithms are known for learning a monotone distribution over [n] , so if the locations 
of the k modes of p were known then it would be straightforward to learn p very efficiently by running the 
monotone distribution learner over k + I separate intervals. But it is clear that in general we cannot hope to 
efficiently identify the modes of p exactly (for instance it could be the case that p{a) = p{a + 2) = 1/n while 
p{a + 1) = 1/n + 1/2"). Still, it is natural to try to decompose the A;-modal distribution into a collection of 
(nearly) monotone distributions and learn those. At a high level that is what our algorithm does, using a novel 
property testing algorithm. 

More precisely, we give a distribution testing algorithm with the following performance guarantee: Let q 
be a A;-modal distribution over [n]. Given an accuracy parameter r, our tester takes poly(A;/T) samples from q 
and outputs "yes" with high probability if q is monotone and "no" with high probability if q is r-far from every 
monotone distribution. (We stress that the assumption that q is /c-modal is essential here, since an easy argument 



given in IIBKR04I shows that Q{n^'^) samples are required to test whether a general distribution over [n] is 
monotone versus 0(l)-far from monotone.) 

With some care, by running the above-described tester 0{k/e) times with accuracy parameter r, we can 
decompose the domain [n] into 

• at most k + 1 "superintervals," which have the property that the conditional distribution of p over each 
superinterval is almost monotone (r-close to monotone); 

• at most A; + 1 "negligible intervals", which have the property that each one has probability mass at most 
0{e/k) under p (so ignoring all of them incurs at most 0(e) total error); and 

• at most k + 1 "heavy" points, which each have mass at least Q,{e/k) under p. 

We can ignore the negligible intervals, and the heavy points are easy to handle; however some care must be 
taken to learn the "almost monotone" restrictions of p over each superinterval. A naive approach, using a generic 
log(n)/e^-sample monotone distribution learner that has no performance guarantees if the target distribution is 
not monotone, leads to an inefficient overall algorithm. Such an approach would require that r (the closeness 
parameter used by the tester) be at most l/(the sample complexity of the monotone distribution learner), i.e. 
r < e'^/log(n). Since the sample complexity of the tester is poly(A;/T) and the tester is run k/e times, this 
approach would lead to an overall sample complexity that is unacceptably high. 

Fortunately, instead of using a generic monotone distribution learner, we can use the semi-agnostic mono- 
tone distribution learner of Birge (Theorem |3]) that can handle deviations from monotonicity far more efficiently 
than the above naive approach. Recall that given draws from a distribution q over [n] that is r-close to mono- 
tone, this algorithm uses 0(log(n)/e^) samples and outputs a hypothesis distribution that is (2t + e)-close to 
monotone. By using this algorithm we can take the accuracy parameter r for our tester to be ©(e) and learn the 
conditional distribution of p over a given superinterval to accuracy 0(e) using 0(log(n)/e^) samples from that 
superinterval. Since there are /c + 1 superintervals overall, a careful analysis shows that 0{k log(n)/e^) samples 
suffice to handle all the superintervals. 

We note that the algorithm also requires an additional additive poly (k/e) samples (independent of n) besides 
this dominant term (for example, to run the tester and to estimate accurate weights with which to combine the 
various sub-hypotheses). The overall sample complexity we achieve is stated in Theorem [6] below. 

Theorem 6 (Main) The algorithm Learn-kmodal uses O {k\og{n / k) / e^ + (/c^/e^) • log(A;/e) • \og\og{k/€)) 
samples, performs poly(A;,logn, 1/e) bit-operations, and learns any k-modal distribution to accuracy e and 
confidence 9/10. 

Theorem [U follows from Theorem[6]by running Learn-kmodal 0(log(l/5)) times and using hypothesis 
testing to boost the confidence tol — 8. We give details in Appendix O 

Algorithm Learn-kmodal makes essential use of an algorithm T^ for testing whether a /c-modal dis- 
tribution over [n] is non-decreasing. Algorithm T^(e,(5) uses Oi\og{l/5)) ■ {k/e)"^ samples from a fe-modal 
distribution p over [n], and behaves as follows: 

• (Completeness) If p is non-decreasing, then T^ outputs "yes" with probability at least 1 — 5; 

• (Soundness) If p is e-far from non-decreasing, then T^ outputs "yes" with probability at most 5. 

Let T^ denote the analogous algorithm for testing whether a fc-modal distribution over [n] is non-increasing 
(we will need both algorithms). The description and proof of correctness for T^ is postponed to the following 
subsection (Section [ 



3.3 Algorithm Learn-kmodal and its analysis 

Algorithm Learn-kmodal is given below with its analysis following. 



Learn-kmodal 

Inputs: e > 0; sample access to /c-modal distribution p over [n] 

1. Fix T := e/{100k). Draw r = 0(l/r^) samples from p and let p denote the empirical distribution. 

2. Greedily partition the domain [n] into £ atomic intervals I := {Ii}l^i as follows: /i := [1, j'l], 
where ji := m.m{j E [n] | p{[l,j]) > e/{lOk)}. Fori > 1, if Wj^^Ij = [1, jj], then /j+i := 
[ji + 1; Ji+i]' where jj+i is defined as follows: lfp{[ji + 1, n]) > e/{10k), then jj+i := min{j G 
N I P([ii + li j]) > e/(10^)}. otherwise, ji+i := n. 

3. Set t' := e/(2000/c). Draw r' = 9((A;^/e'^) • log(l/T') loglog(l/T')) samples s from p to use in 
Steps 4-5. 

4. Run both T^(e, r') and T~'-(e, r') over p j , for j = 1, 2, . . ., to find the leftmost atomic interval 

i— 1 ^ 

/ii such that both T^ and T^ return "no" over p j, , . 

Let Ij^ = [ojj , 6jJ. We consider two cases: 

Case 1: Ifp[aj^,6jJ > 2e/(10/i;), define /'^ := [aj^^hj^ — 1] and 6^^ is a /jeavy point. 

Case 2: If p[aj^, 6jJ < 2e/(10/i;) then define /'^ := Ij^. 

Call /'^ a negligible interval. If ji > 1 then define the first superinterval Si to be U^l^ /j, and set 

oi G {t, i} to be oi =t if T^ returned "yes" on p j^-i and to be ai =1 if T^ returned "yes" on 

^1=1 -*» 

5. Repeat Step 3 starting with the next interval Iji+i, i.e. find the leftmost atomic interval Ij^ such 
that both T^ and T^ return "no" over p j^ ■ Continue doing this until all intervals through I^ 

have been used. 

Let S*!, . . . , S't be the superintervals obtained through the above process and (ai, . . . , oj) G {t, i}* 
be the corresponding string of bits. 

6. Draw m = @{k ■ log(n/A;)/e^) samples s' from p. For each superinterval Si, i G [t], run A"-"- on 
the conditional distribution ps^ of p using the samples in s' n Si. Let psi be the hypothesis thus 
obtained. 

7. Output the hypothesis h = Yfi^^ p{Si) ■ ps, + Y.j Ki^j]) ' Ifo, • 



We are now ready to prove Theorem [6l 

Proof: [of Theorem O Before entering into the proof we record two observations; we state them explicitly here 
for the sake of the exposition. 

Fact 4 Let i? C [n] . IfpR is neither non-increasing nor non-decreasing, then R contains at least one left extreme 
point. 

Fact 5 Suppose that R C [n] does not contain a left extreme point. For any e, r, //'T"(e, r) and T~''(e, r) are 
both run on p^, then the probability that both calls return "no " is at most r. 

Proof: By Fact|4]p^ is either non-decreasing or non-increasing. \f pr is non-decreasing then T^ will output 
"no" with probability at most r, and similarly, ifpR is non-increasing then T^ will output "no" with probability 
at most r. ■ 
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Since r = 6(1/t^) samples are drawn in the first step, the DKW inequality implies that with probability of 
failure at most 1/100 each interval / C [n] has \p{I) — p{I)\ < 2r. For the rest of the proof we condition on 
this good event. 

Since every atomic interval / G X has p{I) > e/(10A;) (except potentially the rightmost one), it follows 
that the number I of atomic intervals constructed in Step 2 satisfies £ < 10 • {k/e). Moreover, by the DKW 
inequality, each atomic interval /« has p{Ii) > 8e/(100A;). 

NotethatinCase(l) of Step4, ifp[aj^,6jj] > 2e/(10A;) then it must be the case that p(6j J > e/(10A;) (and 
thus p{bj^ ) > 8e/{100k)). In this case, by definition of how the interval Ij-^ was formed, we must have that I'- = 
[ojj, bj-^ — 1] satisfies p{I'j^) < e/(10A;). So both in Case 1 and Case 2, we now have that p{I'j^) < 2e/(10fc), 
and thus p{I'j^) < 22e/(100/c). Entirely similar reasoning shows that every negligible interval constructed in 
Steps 4 and 5 has mass at most 22e/(100/c) under p. 

In Steps 4-5 we invoke the testers T-'- and T^ on the conditional distributions of (unions of contiguous) 
atomic intervals. Note that we need enough samples in every atomic interval, since otherwise the testers provide 
no guarantees. We claim that with probability at least 99/100 over the sample s of Step 3, each atomic interval 
gets b = Q ((/c/e)^ • log(l/r')) samples. This follows by a standard coupon collector's argument, which we now 
provide. As argued above, each atomic interval has probability mass J7(e/fc) under p. So, we have I = 0{k/e) 
bins (atomic intervals), and we want each bin to contain b balls (samples). It is well-known IINS60I that after 
taking @{£ • log £ + i ■ b ■ log log i) samples from p, with probability 99/100 each bin will contain the desired 
number of balls. The claim now follows by our choice of parameters. Conditioning on this event, any execution 
of the testers T^(e,r') and T^{e,T') in Steps 4 and 5 will have the guaranteed completeness and soundness 
properties. 

In the execution of Steps 4 and 5, there are a total of at most £ occasions when T^(e, r') and T^(e, r') are 
both run over some union of contiguous atomic intervals. By Fact [5] and a union bound, the probability that 
(in any of these instances the interval does not contain a left extreme point and yet both calls return "no") is at 
most (10A;/e)T' < 1/200. So with failure probability at most 1/200 for this step, each time Step 4 identifies a 
group of consecutive intervals /,,..., Ij^r such that both T^ and T-'- output "no", there is a left extreme point 
in ul^^Ii. Since p is fc-modal, it follows that with failure probability at most 1/200 there are at most fc + 1 total 
repetitions of Step 4, and hence the number t of superintervals obtained is at most k + 1. 

We moreover claim that with very high probability each of the t superintervals Si is very close to non- 
increasing or non-decreasing (with its correct orientation given by Cj): 

Claim 6 With failure probability at most 1/100, each i G [t] satisfies the following: ifai =t then ps^ is e-close 
to a non-decreasing distribution and if ai =1 then ps^ is e-close to a non-increasing distribution. 

Proof: There are at most 2£ < 2Qk/e instances when either T^ or T^ is run on a union of contiguous intervals. 
For any fixed execution of T^ over an interval /, the probability that T^ outputs "yes" while pi is e-far from 
every non-increasing distribution over / is at most r', and similarly for T^. A union bound and the choice of r' 
conclude the proof of the claim. ■ 

Thus we have established that with overall failure probability at most 5/100, after Step 5 the interval [n] has 
been partitioned into: 

1. A set {Si}\^i of t < k -\- 1 superintervals, with p{Si) > 8e/(100A;) and ps^ being e-close to either 
non-increasing or non-decreasing according to the value of bit a^. 

2. A set {/-l-Li of t' < /c + 1 negligible intervals, such thatp(/^) < 22e/(100A;). 

3. A set {6i} -li of t" < A; + 1 heavy points, each with p{bi) > 8e/(100A;). 

We condition on the above good events, and bound from above the expected total variation distance (over the 
sample s'). In particular, we have the following lemma: 

Lemma 7 We have that Eg' [dTv{h,p)] < 0(e). 
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Proof: (of Lemma 111) By the discussion preceding the lemma statement, the domain [n] has been partitioned into 
a set of superintervals, a set of negligible intervals and a set of heavy points. As a consequence, we can write 

p=i: p{Sj)-ps, + 1 pi{b,}) ■ lb, + E p{i'j)-Pi'. 
j=i j=i j=i 

Therefore, we can bound the total variation distance as follows: 

dTv{h,p) < J: \p{Sj)-p{Sj)\ + t \m) -Pibj)\ + t Pil'j) + E PiSj) ■ dTv{ps„Ps,)- 
j=i j=i j=i j=i 

Recall that each term in the first two sums is bounded from above by 2r. Hence, the contribution of these terms 
to the RHS is at most 2r • (2k + 2) < e/10. Since each negligible interval Ij has p(/j) < 22e/(100fc), the 
contribution of the third sum is at most t' ■ 22e/(100A;) < e/4. It thus remains to bound the contribution of the 
last sum. 

We will show that 

"" t ^ 

J2PiSj) ■dTv{pSj,PSj 



E, 



s' 

J = l 



< 0{e). 



Denote n, = \Si\. Clearly, Ej=i '^i — "^- Since we are conditioning on the good events (l)-(3), each 
superinterval is e-close to monotone with a known orientation (non-increasing or non-decreasing) given by aj. 
Hence we may apply Theorem [3] for each superinterval. 

Recall that in Step 5 we draw a total of m samples. Let mi,i ^ [t] be the number of samples that land in Si, 
observe that rrii is a binomially distributed random variable with rrij ~ Bin(rn,,p(5'j)). We apply Theorem[3]for 
each e-monotone interval, conditioning on the value of rrii, and get 

dwfe,PsJ < 2e + O ((log ni/(mi + l))i/3 
Hence, we can bound from above the desired expectation as follows 

E p{S,) ■ E,, [d^v{ps,,Ps,)\ < ( E 2e • p{S,)\ + O ( E P{S,) ■ (logn,)!/^ • ^,,[{m, + IT^/^]] . 

Since E, p{Sj) < 1, to prove the lemma, it suffices to show that the second term is bounded, i.e. that 

E Pis,) ■ (logn,)V3 . E^,[(^^. + i)-i/3] ^ o(^)_ 

To do this, we will first need the following claim: 

Claim 8 For a binomial random variable X ~ Bin(m, q) it holds E[(X + 1)^^'^] < (mq)^^'^. 

Proof: Jensen's inequality implies that B[{X + 1)"^/^] < (E[1/(X + 1)])^/^ We claim that E[1/(X + 1)] < 
1/E[X]. This can be shown as follows: We first recall that E[X] = m- q. For the expectation of the inverse, we 
can write: 

"^ 1 /m\ . 1 "I /^iiX 



q- (m+l) j=i V i 
1 - (1 - g)™+i 1 



< 



q ■ {m + 1) m ■ q 
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The claim now follows by the monotonicity of the mapping x — )• x^'^. ■ 

By Claim [HI applied to rrij ~ Bin(m,p(S'j)), we have that Es/[(mj + 1)^-'^/^] < m"^/^ • (p(S'j))~^/^. 
Therefore, our desired quantity can be bounded from above by 

,fe ml/3 . (p(5,))V3 - ^^^^ ^ti ^^^^^^^ U • log(n/A:); " 

We now claim that the second term in the RHS above is upper bounded by 2. Indeed, this follows by an 
application of Holder's inequality for the vectors {p{Sj)'^^'^Yj^i and {{ i..\o^UiW) )^^^Yj=i< with Holder conjugates 
3/2 and 3. That is. 




The first inequality is Holder and the second uses the fact that Yli=iPi'^j) — 1 ^^^ Z]i=ilos(^j) — * ' 
log(n/t) < {k + I) ■ log{n/k). This last inequality is a consequence of the concavity of the logarithm and the 
fact that ^ rij < n. This completes the proof of the Lemma. ■ 

By applying Markov's inequality and a union bound, we get that with probability 9/10 the algorithm 
Learn-kmodal outputs a hypothesis h that has d-^v{h,p) < 0(e) as required. 

It is clear that the algorithm has the claimed sample complexity. The running time is also easy to analyze, 
as it is easy to see that every step can be performed in polynomial time (in fact, nearly linear time) in the sample 
size. This completes the proof of Theorem [6l ■ 

3.4 Testing whether a fc-modal distribution is monotone 

In this section we describe and analyze the testing algorithm T^. Given sample access to a /c-modal distribution 
q over [n] and r > 0, our tester T^ uses 0{k'^ /t"^) many samples from q and has the following properties: 

• If g is non-decreasing, T^ outputs "yes" with probability at least 2/3. 

• If g is r-far from non-decreasing, T^ outputs "no" with probability at least 2/3. 

(The algorithm T^(r, 6) is obtained by repeating T^ 0{log{l/6)) times and taking the majority vote.) 



Tester T'^(r) 

Inputs: T > 0; sample access to A:-modal distribution q over [n] 








1. Fix 6 := t/(100/c). Draw r = G(l/5^) samples s from 
distribution. 


q and let q 


3e the resulting 


empirical 


2. If there exist a<6<cGsU{l,n} such that 








g(«,^.):=,,f'|"'f,,-«'^;';"> 

[b-a + l] [c-b) 


(r/Ak) 
(b-a + l) 


(r/Ak) 
(c-b) 


(2) 


then output "no", otherwise output "yes". 









The idea behind tester T^ is simple. It is based on the observation that if g is a non-decreasing distribution, 
then for any two consecutive intervals [a, b] and [b + 1, c] the average of q over [6 + 1, c] must be at least as 
large as the average of q over [a, b]. Thus any non-decreasing distribution will pass a test that checks "all" pairs 
of consecutive intervals looking for a violation. Our analysis shows that in fact such a test is complete as well 
as sound if the distribution q is guaranteed to be fc-modal. The key ingredient is the structural Lemma 9 below, 
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which is proved using a procedure (reminiscent of Myerson ironing |Mye81 1) to convert a /c-modal distribution 
to a non-decreasing distribution. 

The following theorem establishes correctness of the tester. 

Theorem 7 The algorithm T^ uses O^k"^ /t'^) samples from q, performs poly(A;/r) • logn bit operations and 
satisfies the desired completeness and soundness properties. 

Proof: The upper bound on the sample complexity is straightforward, since only Step 1 uses samples. It is also 

easy to see that a straightforward implementation of the algorithm runs in time poly(A;/r) • logn. Below we 

prove that the algorithm has the claimed soundness and completeness properties. 

Let us say that the sample s is good if every interval / C [n] has \q{I) — q{I) \ < 25. By the DKW inequality, 

with probabihty at least 2/3 the sample s is good. Assuming that s is good, we have that /or any a < b < c ^ [n] 

the quantity 

p. , . q{[a,b]) q{[b+l,c]) 

hia, 0, c) := — -. — — 

^ ' ' ^ (b-a + l) (c-b) 

,(r. h r.\ , 

{b-a+l) ^ {c-b)- 

E{a,b,c)-E{a,b,c)\<j{a,b,c). (3) 



differs from its empirical value E{a, b, c) (i.e. the LHS of (|2l)) by at most 7(0, b, c) := .,_^^ ,^ + , ^, ^ . That is. 



We first show completeness. If q is non-decreasing the average probability value in any interval [a, b] is 
a non-decreasing function of a. That is, for all a < b < c ^ [n] it holds E{a, b, c) < 0. Therefore, with 
probability at least 2/3, it holds E{a, b, c) < 7(0, b, c) and the tester says "yes". 

For soundness, we need the following lemma: 

Lemma 9 Let q be a k-modal distribution over [n] that is r-farfrom being non-decreasing. Then there exists a 
triple of points a < 6 < c G [n] such that 

E{a,b,c)> ^ ' J +^^. (4) 

(0 — a + 1) \c — b) 

We first show how the soundness follows from the lemma. For q a /c-modal distribution that is r-far from 
non-decreasing, we will argue that if the sample is good then there exists a triple Sa < Sfc < Sc £ s U {1, n} 
such that i?(sa, s;,, Sc) satifsies Q. 

By Lemma |9l there exists a triple a <b < c ^\n\ satisfying (01). 

We first note that at least one sample must have landed in [a, 6], for otherwise the DKW inequality would 
give that (/([a, 6]) < 2b\ this in turn would imply that E{a^ b, c) < 26/{b — a + 1), a contradiction, as it violates 
dUl. We now define the points Sa^Sb, Sc as follows: (i) Sa is the leftmost point of the sample in [a, b], (ii) Sb is the 
rightmost point of the sample in [a, b]; and (iii) Sc is either the leftmost point of the sample in [c + 1, n], or the 
rightmost point n of the interval, if g([c + 1, n]) =0. We will now argue that these points satisfy ^. Consider 
the interval [sa, Sb]. Then, we have that 

Q{[sa, Sb]) ^ q{[sa,Sb]) ^ q{[a,b]) ^ q{[a,b]) 26 



Sb — Sa + i b — a + I b — a + I b — a + I b — a + I 

where the first inequality uses the fact that [sa, Sb] ^ [a, b], the equality uses the definition of a and b, and the 
final inequality follows by an application of the DKW inequality for the interval [a,b]. An analogous argument 
can be applied for the interval [sb, Sc]. Indeed, we have that 

q{[sb + hsc]) ^ q{[sb + l,Sc]) q{[b+l,c]) q{[b + l,c\) 26 
Sc — Sb + I c — b c — b c — b c — b 

where the first inequality follows from the fact that [sb, Sc] 5 [6 + 1, c], the equality uses the definition of b and 
c, and the final inequality follows by an application of the DKW inequality for the interval [6 + 1, c]. 
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A combination of Q, ^, Q yields the desired result. It thus remains to prove Lemma|9l 
Proof: [Lemma |9l We prove the contrapositive. Let g be a /e-modal distribution such that for all a < 6 < c G [n] 

, , , (r/2A;) (r/2A;) 

(o — a + 1) \C- — h) 

We will show that q is r-close to being non-decreasing by constructing a non-decreasing distribution q that is 
r-close to q. The construction of q proceeds in k stages where in each stage, we reduce the number of modes 
by at least one and incur error in variation distance at most t jk. That is, we iteratively construct a sequence of 
distributions {g*^*^}jLQ, q^^^ = q and q^^^ = q, such that for all i € [A;] we have that g^*) is {k — i)-modal and 

Consider the graph (histogram) of the discrete density q. The x-axis represents the n points of the domain 
and the y-axis the corresponding probabilities. We first informally describe how to obtain q^^' from q. The 
construction of g(*+^) from (7^*) is identical. Let j be the leftmost left-extreme point (mode) of q, and assume 
that it is a local maximum with height (probability mass) q{j). (A symmetric argument works for the case that it 
is a local minimum.) The idea of the proof is based on the following simple process (reminiscent of Myerson's 



ironing process |Mye81 1): We start with the horizontal line y = q{j) and move it downwards until we reach 
a height ho < q{j) so that the total mass "cut-off" equals the mass "missing" to the right ; then make the 
distribution "flat" in the corresponding interval (hence, reducing the number of modes by at least one). The 
resulting distribution is q^^^ and equation (|7]) implies that dxy ((?^^\ q) < r/k. 

We now proceed with the formal argument, assuming as above that the leftmost left-extreme point j of q is 
a local maximum. We say that the line y = h intersects a point i G [n] in the domain of q if q{i) > h. The 
line y = h, h ^ [0,q{j)], intersects the graph of q at a unique interval I{h) C [n] that contains j. Suppose 
I{h) = [a(/i),6(/i)], where a{h),b{h) € [n] depend on h. By definition this means that q{a{h)) > h and 
q{a{h) — 1) < h. Recall that the distribution q is non-decreasing in the interval [1, j] and that j > a{h). The 
term "the mass cut-off by the line y = h" means the quantity A{h) = q {I{h)) — h • (6(/i) — a{h) + 1), i.e. the 
"mass of the interval I{h) above the line". 

The height h of the line y = h defines the points a{h),b{h) G [n] as described above. We consider values 
of h such that q is unimodal (increasing then decreasing) over I{h). In particular, let / be the leftmost mode 
of q to the right of j, i.e. j' > j and j' is a local minimum. We consider values of /i G iq{j'), q{J))- For such 
values, the interval I{h) is indeed unimodal (as b{h) < j'). For h G {QiJ'),QiJ)) we define the point c(/i) > / 
as follows: It is the rightmost point of the largest interval containing j' whose probability mass does not exceed 
h. That is, all points in [/, c{h)] have probability mass at most h and q{c{h) + 1) > /i (or c{h) = n). 

Consider the interval J{h) = [h{h) + l,c{h)\. This interval is non-empty, since fe(/i) < j' < c{h). (Note that 
J{h) is not necessarily a unimodal interval; it contains at least one mode j', but it may also contain more modes.) 
The term "the mass missing to the right of the line y = h" means the quantity B{h) = h-{c{h)—h{h))—q {J{h)). 

Consider the function C{h) = A{h) — B{h) over [q{j'),q{3)\- This function is continuous in its domain; 
moreover, we have that C {q{j)) = A {q{j)) - B {q{j)) < 0, as A {q{j)) = 0, and C {q{j')) = A {q{j')) - 
B {<l{j')) > 0, as B{q{j')) = 0. Therefore, by the intermediate value theorem, there exists a value /iq G 
{q{f),q{j)) such that .4(/io) = B{ho). 

The distribution q^^^ is constructed as follows: We move the mass r' = A{ho) from /(/iq) to J(/io)- Hence, 
it follows that d'YviQ^^^ q) ^ 2r'. We also claim that q^^") has at least one mode less than q. Indeed, q^^^ is 
non-decreasing in [1, a{h) — 1] and constant in [a{h), c{h)]. (All the points in the latter interval have probability 
mass exactly /iq.) Recalling that q^^\a{h)) = ho > q^^\a{h) — 1) = q{a{h) — 1), we deduce that q^^^ is 
non-decreasing in [1, c(/i)]. 

We will now argue that r' < T/{2k) which completes the proof of the lemma. To this end we use our 
starting assumption, equation (O. Recall that we have A{ho) = B{ho) = r', which can be written as 

q{[a{h),b{h)]) - ho ■ {b{h) - a{h) + I) = ho ■ {c{h) - h{h)) - q{[b{h) + 1, c{h)]) = r' . 
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From this, we get 



q{[a{h),b{h)]) q{[b{h) + l,c{h)]) _ r' ^ 



t' 



{b{h) - a{h) + 1) (c(/i) - b{h)) {b{h) - a{h) + 1) {c{h) - b{h)) ' 

Combining with (O proves Lemma HI ■ 

This completes the proof of Theorem |7J ■ 

4 Conclusions and future work 

At the level of techniques, this work illustrates the viabihty of a new general strategy for developing efficient 
learning algorithms, namely by using "inexpensive" property testers to decompose a complex object (for us 
these objects are A;-modal distributions) into simpler objects (for us these are monotone distributions) that can 
be more easily learned. It would be interesting to apply this paradigm in other contexts such as learning Boolean 
functions. 

At the level of the specific problem we consider - learning /c-modal distributions - our results show that 
A;-modality is a useful type of structure which can be strongly exploited by sample-efficient and computationally 
efficient learning algorithms. Our results motivate the study of computationally efficient learning algorithms 
for distributions that satisfy other kinds of "shape restrictions." Possible directions here include multivariate 
A;-modal distributions, log-concave distributions, monotone hazard rate distributions and more. 

Finally, at a technical level, any improvement in the sample complexity of our property testing algorithm of 
Section fTA\ would directly improve the "extraneous" additive 0{{k/e)^) term in the sample complexity of our 
algorithm. We suspect that it may be possible to improve our testing algorithm (although we note that it is easy 
to give an il(\/A;) lower bound using standard constructions). 
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A Birge's algorithm as a semi-agnostic learner 

In this section we briefly explain why Birge's algorithm ||Bir87bl also works in the semi-agnostic setting. To do 
this, we need to explain his approach. For this, we will need the following theorem, which gives a tight bound 
on the number of samples required to learn an arbitrary distribution with respect to total variation distance. 



Theorem 8 (Folklore) Let p be any distribution over [n\. We have: 'Ei[dTv{p-,Pm)\ < 2^nlm. 

Let p be a non-increasing distribution over [n]. (The analysis for the non-decreasing case is identical.) 
Conceptually, we view algorithm L^ as working in three steps: 

• In the first step, it partitions the set [n] into a carefully chosen set Ji , . . . , /^ of consecutive intervals, 
with i = 0{m}-/^ ■ (logn)^/'^). Consider the flattened distribution pf over [n] obtained from p by av- 
eraging the weight that p assigns to each interval over the entire interval. That is, for j G [i] and 

i G Ij, Pf{i) = '^t(zi.p{'t)/\Ij\- Then a simple argument given in JBirSTbll gives that dTviPf^p) = 
0((logn/(m + l))i/3). 

• Let pr be the reduced distribution corresponding to p and the partition /i, . . . , /^. That is, pr is a distribu- 
tion over [i] \Nith pr{i) = p{Ii) for i G [(\. In the second step, the algorithm uses the m samples to learn 
Pj.. (Note that pr is not necessarily monotone.) After m samples, one obtains a hypothesis pj- such that 

^[dTviPriPr)] = O ( ^JTfrnA = O ((logn/(m, + 1))^''^). The first equality follows from Theorem[8] 
(since pr is distribution over i elements) and the second inequality follows from the choice of i. 

• Finally, the algorithm outputs the flattened hypothesis (p^)/ over [n] corresponding to p^, i.e. obtained by 
Pr by subdividing the mass of each interval uniformly within the interval. It follows from the above two 
steps that E[dTy((p;)/,P/)] = O ((logn/(m + l))i/3) . 

• The combination of the first and third steps yields that E[dTv((Pr)/;P)] = O ((log n/{m + 1))^'^) . 

The above arguments are entirely due to Birge IIBir87bl . We now explain how his analysis can be extended 
to show that his algorithm is in fact a semi-agnostic learner as claimed in Theorem |3] To avoid clutter in the 
expressions below let us fix (5 := O ((logn/(?7i + 1))^/'^). 

The second and third steps in the algorithm description above are used to learn the distribution pf to variation 
distance 5. Note that these steps do not use the assumption that p is non-increasing. The following claim, which 
generalizes Step 1 above, says that if p is r-close to non-increasing, the flattened distribution pj (defined as 
above) is (2r + 5)-close to p. Therefore, it follows that, for such a distribution p, algorithm L^^ succeeds with 
expected (total variation distance) error (2r + 5) + 5. 

We have: 

Claim 10 Let p be a distribution over [n] that is r-close to non-increasing. Then, the flattened distribution pf 
(obtained from p by averaging its weight on every interval Ij) satisfles dTviPf^p) — i'^'^ + '^)- 

Proof: Let p^ be the non-increasing distribution that is r-close to p. Let Tj denote the Li -distance between p 
and p-^ in the interval Ij. Then, we have that 

e 

Y,r,<r. (8) 

i=i 

By Birge's arguments, it follows that the flattened distribution (p^)f corresponding to p^ is J-close to p^, 
hence (r + J) -close to p. That is, 

c^w ((/)/, p) <r + 5. (9) 

17 



We want to show that 

dTv[{p^)f,Pf)<r. (10) 

Assuming (flOl ) holds, we can conclude by the triangle inequality that 

dTv{p,Pf) <2t + 5 

as desired. 

Observe that, by assumption, p and p^ have Li-distance at most tj in each Ij interval. In particular, this 
impUes that, for all j £ [£], it holds 



p{Ij)-pHIj) 



<TJ. 



Now note that, within each interval Ij, pf and {p^)f are both uniform. Hence, the contribution of Ij to the 
variation distance between pf and {p^)f is at most \p{Ij) — p-^{Ij)\. 
Therefore, by (HJ we deduce 

dTv{Pf,{P^)f) < T 

which completes the proof of the claim. ■ 



B Hypothesis Testing 

Our hypothesis testing routine Choose-Hypothesis^ runs a simple "competition" to choose a winner be- 
tween two candidate hypothesis distributions hi and /i2 over [n] that it is given in the input either explicitly, or 
in some succinct way. We show that if at least one of the two candidate hypotheses is close to the target distri- 
bution p, then with high probability over the samples drawn from p the routine selects as winner a candidate that 
is close to p. This basic approach of running a competition between candidate hypotheses is quite similar to the 
"Scheffe estimate" proposed by Devroye and Lugosi (see l|DL96b[ rDL96al and Chapter 6 of BDLOlll ). which in 
turn built closely on the work of [Yat85], but there are some small differences between our approach and theirs; 
the MDLOlll approach uses a notion of the "competition" between two hypotheses which is not symmetric under 
swapping the two competing hypotheses, whereas our competition is symmetric. 
We now prove Theorem |4l 

Proof: [of Theorem |4l Let W be the support of p. To set up the competition between hi and /12, we define the 
following subset of W: 

Wi = Wi{hi,h2) ■.= {w£W\hi{w) >h2{w)}. (11) 

Let then pi = hi{Wi) and qi = /i2(Wi). Clearly, pi > qi and dxy (/ii, /12) = Pi — Qi- 
The competition between hi and /i2 is carried out as follows: 

1. li pi — qi < 5e', declare a draw and return either hi. Otherwise: 

2. Draw m = O ( °^\{ ' ) samples si, . . . ,Sm from p, and let r = — |{i | Sj S Wi}| be the fraction of 



samples that fall inside Wi . 

3. If T > pi — |e', declare hi as winner and return hi; otherwise, 

4. ii T < qi + |e', declare /12 as winner and return h2', otherwise, 

5. declare a draw and return either hi. 



18 



It is not hard to check that the outcome of the competition does not depend on the ordering of the pair of 
distributions provided in the input; that is, on inputs {hi, /12) and (/12, hi) the competition outputs the same 
result for a fixed sequence of samples si, . . . ,Sm drawn from p. 

The correctness of Choose-Hypothesis is an immediate consequence of the following lemma. 

Lemma 11 Suppose that dTv{p, ^1) ^ e'- Then: 

(i) Ifd^viPj ^2) > 6e', then the probabiUty that the competition between hi and /12 does not declare hi as 
the winner is at most e~™*^ /^. (Intuitively, ifh2 is very bad then it is very likely that hi will be declared 
winner.) 

(ii) IfdTviPj ^2) > 4e', the probability that the competition between hi and /12 declares /i2 as the winner is 
at most e"'"'^ ". (Intuitively, if /12 i^ only moderately bad then a draw is possible but it is very unlikely 
that /i2 will be declared winner) 

Proof: Let r = p(>Vi). The definition of the total variation distance implies that |r — pi| < e'. Let us 
define the 0/1 (indicator) random variables {Zj}f^^ as Zj = 1 iff Sj G Wi. Clearly, t = ^ YlJ=i ^j 
and E[t] = E[Zj] = r. Since the Z^'s are mutually independent, it follows from the Chernoff bound that 

Pr[r < r - e' /2\ < e'^'^''/^. Using \r - pi\ < e' we get that Pt[t < pi - 3e72] < e-"''"/^. 

• For part (i): If d^viPj ^2) > 6e', from the triangle inequality we get that Pi — qi = dTvi^i, /12) > 5e'. 
Hence, the algorithm will go beyond Step 1, and with probability at least 1 — e~"^'^ ' ^, it will stop at Step 
3, declaring hi as the winner of the competition between hi and /i2. 

• For part (ii): If pi — qi < 5e' then the competition declares a draw, hence /12 is not the winner. Otherwise 
we have pi — qi > 5e' and the above arguments imply that the competition between hi and /i2 will declare 
/i2 as the winner with probability at most e"*""^ '^. 

This concludes the proof of Lemma [TTl 

■ 

The proof of the theorem is now complete. ■ 



C Using the Hypothesis Tester 

In this section, we explain in detail how we use the hypothesis testing algorithm Choose-Hypothesis 
throughout this paper. In particular, the algorithm Choose-Hypothesis is used in the following places: 

• In Step 4 of algorithm Learn-kmodal-simple we need an algorithm L-^g' (resp. L^^/) that learns a 
non-increasing (resp. non-increasing) distribution within total variation distance e and confidence 6'. Note 
that the corresponding algorithms L^ and L^ provided by Theorem [3] have confidence 9/10. To boost the 
confidence of L-l- (resp. L^) we run the algorithm 0{log{l/6')) times and use Choose-Hypothesis in 
an appropriate tournament procedure to select among the candidate hypothesis distributions. 

• In Step 5 of algorithm Learn-kmodal-simple we need to select among two candidate hypothesis 
distributions (with the promise that at least one of them is close to the true conditional distribution). In 
this case, we run Choose-Hypothesis once to select between the two candidates. 

• Also note that both algorithms Learn-kmodal-simple and Learn-kmodal generate an e-accurate 
hypothesis with probability 9/10. We would like to boost the probability of success to 1 — 6. To achieve 
this we again run the corresponding algorithm 0{log{l/6)) times and use Choose-Hypothesis in an 
appropriate tournament to select among the candidate hypothesis distributions. 
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We now formally describe the "tournament" algorithm to boost the confidence to 1 — 6. 

Lemma 12 Let p be any distribution over a finite set W. Suppose that D^ is a collection ofN distributions over 
W such that there exists q £ V^with d-xvip^l) ^ £■ Then there is an algorithm that uses 0{e~^ log Nlog{l/ 6)) 
samples from p and with probability 1 — 6 outputs a distribution p' G D^ that satisfies d^v (Pi p') ^ 6e. 

Devroye and Lugosi (Chapter 7 of BDLOlll ) prove a similar result by having all pairs of distributions in the 
cover compete against each other using their notion of a competition, but again there are some small differences: 
their approach chooses a distribution in the cover which wins the maximum number of competitions, whereas our 
algorithm chooses a distribution that is never defeated (i.e. won or achieved a draw against all other distributions 
in the cover). Instead we follow the approach from IIDDSI . 

Proof: The algorithm performs a tournament by running the competition Choose-Hypothesis^(/ij, /ij,e, 
6/{2N)) for every pair of distinct distributions hi, hj in the collection P^. It outputs a distribution q* G V^ that 
was never a loser (i.e. won or achieved a draw in all its competitions). If no such distribution exists in V^ then 
the algorithm outputs "failure." 

By definition, there exists some q £ V^ such that dxy (p, q) < £• We first argue that with high probability this 
distribution q never loses a competition against any other q' € V^ (so the algorithm does not output "failure"). 
Consider any q' £ V^. If d^viPiQ') > 4e, by Lemma fTTlii) the probability that q loses to q' is at most 
2g-me /2 _ 0{1/N). On the other hand, if dxy (p, q') < 45, the triangle inequality gives that dTvil, q') < 5e 
and thus q draws against q'. A union bound over all N distributions in V^ shows that with probability 1 — 6/2, 
the distribution q never loses a competition. 

We next argue that with probability at least 1 — 6/2, every distribution q' G V^ that never loses has small 
variation distance from p. Fix a distribution q' such that dT^Y{q' ,p) > 6e; Lemma lT\\ i) implies that q' loses 
to q with probability 1 — 2e~'^'^ /^ > l — 6/{2N). A union bound gives that with probability 1 — 6/2, every 
distribution q' that has dT^y{q' ,p) > 6e loses some competition. 

Thus, with overall probability at least 1 — 6, the tournament does not output "failure" and outputs some 
distribution q* such that ^tvIPi Q*) is at most 6e. This proves the lemma. ■ 

We now explain how the above lemma is used in our context: Suppose we perform 0{log{l/6)) runs of 
a learning algorithm that constructs an e-accurate hypothesis with probability at least 9/10. Then, with failure 
probability at most 6/2, at least one of the hypotheses generated is e-close to the true distribution in variation dis- 
tance. Conditioning on this good event, we have a collection of distributions with cardinality O (log (1/(5)) that 
satisfies the assumption of the lemma. Hence, using O ((1/e^) • loglog(l/5) • log{l/6)) samples we can learn 
to accuracy 6e and confidence 1 — 6/2. The overall sample complexity is 0{log{l/6)) times the sample com- 
plexity of the (learning algorithm with confidence 9/10) plus this additional O ((1/e^) • log log(l/5) • log{l/6)) 
term. 

In terms of running time,we make the following easily verifiable remarks: When the hypothesis testing 
algorithm Choose-Hypothesis is run on a pair of distributions that are produced by Birge's algorithm, 
its running time is polynomial in the succinct description of these distributions, i.e. in log^(n)/e. Similarly, 
when Choose-Hypothesis is run on a pair of outputs of Lear n-kmodal- simple or Learn-kmodal, 
its running time is polynomial in the succinct description of these distributions. More specifically, in the for- 
mer case, the succinct description has bit complexity O (A; • log^(n)/e^) (since the output consists of 0{k/e) 
monotone intervals, and the conditional distribution on each interval is the output of Birge's algorithm for 
that interval). In the latter case, the succinct description has bit complexity O {k ■ log^(n)/e), since the al- 
gorithm Learn-kmodal constructs only k monotone intervals. Hence, in both cases, each executation of 
the testing algorithm performs poly(A;, log n, 1/e) bit operations. Since the tournament invokes the algorithm 
Choose-Hypothesis 0{\o^{\/6)) times (for every pair of distributions in our pool of 0{\og{\/6)) candi- 
dates) the upper bound on the running time follows. 
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